Jesse Ekweozoh, Nistha Mitra and William Jonathan Faoro
2020 has been historical in ways more than one. One of the most crucial happenings was the Black Lives Matter movement. The wrongful death of George Floyd while in police custody in Minneapolis served as a catalyst of a global uprising of the BLM movement. George Floyd's death served as a face for the enumerable deaths of BIPOC in the hands of the police. Massive outcry on social media and in-person protests sparked the beginning of a massive change from Police department budget cuts to resignation and the removal of monuments and statues.
The central objective of this project is to analyze data containing fatal shootings by the police, using data analysis tools and techniques, and understand the disproportionate impact on Black, Indigenous and People of Color in the United States between 2015 and 2020.We aimed for our work to be a practical tutorial of data visualization and analysis with an in depth step by step description of the code and the logic. Moreover, in the midst of racism discussions, we hope to likely shed light on the mentioned situation that impacts all of us.
You will need the following libraries for this project:
Read through the following resources for more information about pandas/installation and python 3.6 in general:
Data collection is a systematic process of gathering observations or measurements. While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:
Below we have the code to import all the necessary packages we will need in this project.
import sidetable
import pandas as pd #pandas
import numpy as np #module
import seaborn as sns #to visualize
import matplotlib.pyplot as plt #for plotting
import requests
import warnings
warnings.filterwarnings("ignore")
Our aim of the research has been described above. One of the main dataset we will use that will help our thesis statement well is from The Washington Post. The Washington Post released a dataset containing fatal shootings by police in the US between 2015 and 2020. The dataset is available in this repo by the Washington Post. https://github.com/washingtonpost/data-police-shootings
Another data set I am using describes the political results of the 2016 results based on county and states. We will use this to analyze the relation between political affiliation and police shooting. You can access the raw data from https://raw.githubusercontent.com/tonmcg/US_County_Level_Election_Results_08-20/master/2016_US_County_Level_Presidential_Results.csv
Lastly I used a dataset with state names and abbreviation. I would use this to modify the original dataset and make it more readable. https://raw.githubusercontent.com/jasonong/List-of-US-States/master/states.csv
df = pd.read_csv("https://github.com/washingtonpost/data-police-shootings/releases/download/v0.1/fatal-police-shootings-data.csv")
df1 = pd.read_csv("https://raw.githubusercontent.com/tonmcg/US_County_Level_Election_Results_08-20/master/2016_US_County_Level_Presidential_Results.csv")
df3= pd.read_csv("https://raw.githubusercontent.com/jasonong/List-of-US-States/master/states.csv")
A DataFrame is a structure similar to a table or matrix, with rows and columns that contain certain data. Pandas allows us to easily perform a lot of manipulations on DataFrames through the use of their functions. You can find more info at: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html.
You can see the data below that describes the individual shooting and the information known. The df.head() code shows the top few rows. It helps us see the structure of the data and the columns.
print("Fatal shooting dataset has {}".format(df.shape[0]),
"rows and {}".format(df.shape[1]), "columns")
df.head(2)
The following data set gives us the result of the 2016 election. We don't need the entire data set and it will later be cleaned according to our needs.
print("Fatal shooting dataset has {}".format(df1.shape[0]),
"rows and {}".format(df1.shape[1]), "columns")
df1.head(57)
This dataset is more to add convenience when looking at a plot. Abbreviations are helpful but in a plot with multiple variables, not having to remember every code helps.
df3=df3.rename(columns = {'Abbreviation':'state', 'State':'state_name'})
df3.head(3)
After collection comes processing. Here we mean everything from data cleaning, data wrangling, and data formatting to data compression, for efficient storage, and data encryption, for secure storage.
The 'join' fuction joins columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.
Here, joining first dataset to the third to add the state names.
df= df.join(df3.set_index('state'), on='state')
df.head(3)
Body camera column is not useful for my analysis. Moreover as you see below it mostly has one value.We drop that column.
df.body_camera.value_counts()
The names are too personal to use and might be against individual rights. The id column is redundant. Thus, I will drop these columns.
df.drop(['id','name', 'body_camera'], axis=1, inplace=True)
df.head(2)
“Armed”, “age”, “gender”, “race”, and “flee” columns have missing values.
df.isna().sum()
A more informative tool on missing values is the Seaborn heatmap
plt.figure(figsize=(10,7))
sns.heatmap(df.isnull(), cbar = False, cmap = 'viridis')
The “flee” and “armed” columns describe the action of the person being shot.
df.flee.value_counts()
df.armed.value_counts()
The action of “not fleeing” dominates “flee” column. We can fill in the missing values with “not fleeing”.You can choose a different way to handle them like dropiing them.
df.armed.fillna(df.armed.value_counts().index[0], inplace=True)
df.flee.fillna('Not fleeing', inplace=True)
Drop any other column because the other columns describe the person being shot and thus it could be misleading to make assumption without accurate information.
df.dropna(axis=0, how='any', inplace=True)
print("There are {}".format(df.isna().sum().sum()), "missing values left in the dataframe")
The following heatmap shows that there are no more missing data in the modified dataset.
plt.figure(figsize=(10,7))
sns.heatmap(df.isnull(), cbar = False, cmap = 'viridis')
The following copy is made to be used later. The other changed made in df will not be required and thus we are copying and storing it now.
locationInfo = df.copy()
Convert the date column to 'datetime' type which is the data type of pandas to handle dates. After converting, extract “year” from the date and create new column. We will use it to see yearly shooting rates.
df['date'] = pd.to_datetime(df['date'])
df['year'] = pd.to_datetime(df['date']).dt.year
The following code groups the rows by year and state which means that each State's shooting count can be seen per year.
df.insert(2, 'Count_per_year', df.groupby(['year','state'])['year'].transform('size'))
df.head(2)
Making a dataset with just fatality counts per State in the 5 years. Group by state and all the occurance are added regardless of other attributes giving us the total.
kill_st=df.groupby(['state']).size().reset_index(name='counts')
kill_st.head(3)
We clean df1 (that is the second dataset). It has the election information. We drop the unnecessary columns and name the required column according to the dataframe we will join it with. In this case, we will join it with kill_st and name 'state_abbr' , 'state' .
df1.drop(['Unnamed: 0','total_votes','per_dem','per_gop','diff','per_point_diff','combined_fips','county_name'], axis=1, inplace=True)
df1=df1.rename(columns = {'state_abbr':'state'})
We join it with data frame 3 to add the names of the state and we get overall state votes.
df1= df1.join(df3.set_index('state'), on='state')
df1.head(3)
The following code sums each county vote according to state to give total votes for Democrats and Republicans . We create another column that compares the two votes per state and declare the state Red or Blue (Republican or Democrat)
df1 = df1.groupby(['state']).agg({'votes_dem' : 'sum', 'votes_gop': 'sum'}).reset_index()
df1['pol_m'] = np.where(df1['votes_dem'] > df1['votes_gop'], "Blue", "Red")
Finally, if you merge the kill per state dataframe with the dataframe above, you get whether a state is Blue and Red and how many fatalities it saw in police shooting in the last 5 years.
by_state = pd.DataFrame()
by_state= pd.merge(df1, kill_st, on='state')
by_state= by_state.join(df3.set_index('state'), on='state')
by_state.head()
We copy the original data frame and drop duplicates columns that exist on year and state because each person who died that year in that state will have the same info. We end up with a dataset that tells us how many people died per year in that State.
df2= df.copy()
df2 = df2.drop(['manner_of_death','date','armed','age','gender','race','city','signs_of_mental_illness','threat_level','flee','longitude','latitude','is_geocoding_exact'], 1)
# dropping duplicate values
df2= df2.drop_duplicates(['year','state'],keep= 'last')
df2.reset_index(inplace=True)
df2.drop(['index'], axis=1, inplace=True)
df2= df2.sort_values(['state_name'])
df2.tail(5)
##run a for loop to make seperate dataframes based on yearID and use Seaborn to plot them
by_state.set_index('state_name', inplace=True)
by_state.head()
n this section we are going to visualize the data based on States. We use matplotlib to make a pie chart that shows how many kills happen per state in the last 5 years.
ax=by_state['counts'].plot(kind='pie', figsize=(20, 15))
ax.set_aspect('equal')
ax.yaxis.set_label_coords(-0.15, 0.5)
plt.show()
The pie chart was helpful but in order to make it more clear and also understand the mean or the average kills in all the states we make a bar plot. We use seaborne package to display the bar plot and aggregation to find the mean.
mean= by_state["counts"].mean()
mean
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 11)
plt.title("Kills by State")
fig = sns.barplot(y=by_state.index, x=by_state["counts"])
# adding a verticle line for the mean ratio
ax.axvline(mean, color="blue", linewidth=2)
We can clearly see that certain States like California, Florida and Texas have high numbers of fatalities by the Police. However we don't see a certain trend just based on the states.
We therefore plot each state by year to see if there are trends common in all states. We do this by pivoting the data frame we have so that the index could become years and the columns could become the states. Did after we plot each column by iterating through the data frame.
year_rate = df2.pivot(index='year', columns='state_name', values='Count_per_year')
for i, col in enumerate(year_rate.columns):
year_rate[col].plot(fig=plt.figure(i))
plt.title(col)
plt.show()
Again we do not see any certain trend in kills over the years in all states
We provide the states with the most kills below for reference to the plots.
df.stb.freq(['state'], thresh=50)
Since we couldn't see any certain trend based on individual states, big group the states by its political affiliation based on the 2016 election. We take the by_state and use Group by an aggregate function to find the sum of kills in all blue States and all red States. Did after we use a bar plot to show which group of state has more police shooting fatalities.
by_pol= by_state.groupby(['pol_m']).agg({'counts' : 'sum'}).reset_index()
by_pol.reset_index(drop=True)
br= by_pol.plot(kind='bar',x='pol_m',y='counts',
color=["blue","red"])
br.set_xlabel("Democrat Vs Republican")
br.set_ylabel("Fatal Police Shooting")
We see that the fatalities in red states are significantly higher than the blue States. Please keep in mind that according to the 2016 election the presidential candidate of the Republican party was Donald Trump.
plt.figure(figsize=(12,8))
plt.title('Age Distribution of Deaths', fontsize=15)
sns.distplot(df.age)
In this section we visualize the data given in terms of the location and the identities of the data points. We use the folium package to show the places of concentration of the shootings in the United States of America and the identities that seem to dominate the data.
import folium
from folium import plugins
from folium.plugins import HeatMap
locationInfo.head(3)
def per_race(r):
color = ''
if r == 'B':
color = 'black'
elif r == 'W':
color = 'white'
elif r == 'U':
color = 'green'
elif r == 'A':
color = 'yellow'
elif r == 'O':
color = 'blue'
elif r == 'H':
color = 'purple'
elif r == 'N':
color = 'crimson'
return color
map_PoliceKillings = folium.Map()
cluster = folium.plugins.MarkerCluster(name="Fatal Police Shooting").add_to(map_PoliceKillings)
for person in locationInfo.itertuples():
lat = person.latitude
long = person.longitude
if person.gender == 'M':
sex_c= 'blue'
else:
sex_c= 'pink'
race_c= per_race(person.race)
sex = "Sex: {} ".format(person.gender)
age = "Age: {}".format(person.age)
race = "Race: {} ".format(person.race)
armed = "Armed status {} ".format(person.armed)
content = sex + "\n" + age + "\n"+ race + "\n" + armed
newMarker = folium.Marker([lat,long], popup=content, icon=folium.Icon(color=sex_c,icon_color=race_c))
newMarker.add_to(cluster)
map_PoliceKillings
We see that there are more white Americans who have been shot than any other race. This brings about the question of whether our thesis was even right?
We asked the specific question of whether BIPOC was impacted more by fatal police shooting. However our map showed that there were more white people dying. However we realize that we forgot to add a very integral factor to our visualization. The population of these races.
We will use the populations in 2019 which is available on US Census website. Although the ratios have changed from 2015 to 2020, it is not a dramatic change like 10–15 percent. I think the ratios remain within a margin of a few percents. However, you can use exact populations in each year to be more accurate.
df_pop = pd.DataFrame({'race':['W','B','A','H','N','O'],
'population':[0.601, 0.134, 0.059, 0.185, 0.013, 0.008]})
df_pop['population'] = df_pop['population']*328
df_pop
df_race = df[['race','year','armed']].groupby(['race','year']).count().reset_index()
df_race.rename(columns={'armed':'number_of_deaths'}, inplace=True)
df_race.head(4)
df_race = pd.merge(df_race, df_pop, on='race')
df_race['deaths_per_million'] = df_race['number_of_deaths'] / df_race['population']
df_race.head()
plt.figure(figsize=(12,8))
plt.title("Fatal Shootings by Police", fontsize=15)
sns.barplot(x='year', y='deaths_per_million', hue='race', data=df_race )
Our analysis employed the use of standard python libraries to import, modify and analyze our chosen dataset to support our central thesis. We used 3 datasets each consisting of fatal police shootings, political results of the 2016 elections by county/state and list of states and abbreviations. The purpose of these datasets was to draw a relation between political affiliation and police shooting. We performed several data manipulation techniques in cleaning our data for analysis. Data visualization was a useful tool in describing our findings. We used standard plotting libraries in python such as matplotlib to make pie charts and the seaborne package for our bar plots. We finally analyzed the number of deaths by race and discovered based on our bar plot that black people had the highest death rate by the police in America.